Search CORE

72 research outputs found

Global disease monitoring and forecasting with Wikipedia

Author: Del Valle Sara Y.
Deshpande Alina
Fairchild Geoffrey
Generous Nicholas
Priedhorsky Reid
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 15/07/2014
Field of study

Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data such as social media and search queries are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with

r^2

up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Comment: 27 pages; 4 figures; 4 tables. Version 2: Cite McIver & Brownstein and adjust novelty claims accordingly; revise title; various revisions for clarit

arXiv.org e-Print Archive

CiteSeerX

Directory of Open Access Journals

PubMed Central

FigShare

Estimating influenza incidence using search query deceptiveness and generalized ridge regression

Author: Barnard Martha
Daughton Ashlynn R.
O'Connell Fiona
Osthus Dave
Priedhorsky Reid
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 11/01/2019
Field of study

Seasonal influenza is a sometimes surprisingly impactful disease, causing thousands of deaths per year along with much additional morbidity. Timely knowledge of the outbreak state is valuable for managing an effective response. The current state of the art is to gather this knowledge using in-person patient contact. While accurate, this is time-consuming and expensive. This has motivated inquiry into new approaches using internet activity traces, based on the theory that lay observations of health status lead to informative features in internet data. These approaches risk being deceived by activity traces having a coincidental, rather than informative, relationship to disease incidence; to our knowledge, this risk has not yet been quantitatively explored. We evaluated both simulated and real activity traces of varying deceptiveness for influenza incidence estimation using linear regression. We found that deceptiveness knowledge does reduce error in such estimates, that it may help automatically-selected features perform as well or better than features that require human curation, and that a semantic distance measure derived from the Wikipedia article category tree serves as a useful proxy for deceptiveness. This suggests that disease incidence estimation models should incorporate not only data about how internet features map to incidence but also additional data to estimate feature deceptiveness. By doing so, we may gain one more step along the path to accurate, reliable disease incidence estimation using internet data. This capability would improve public health by decreasing the cost and increasing the timeliness of such estimates.Comment: 27 pages, 8 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

Epidemiological data challenges: planning for a more robust future through data standards

Author: Daughton Ashlynn R.
Deshpande Alina
Fairchild Geoffrey
Generous Nicholas
Khalsa Hari
Priedhorsky Reid
Tasseff Byron
Velappan Nileena
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2018
Field of study

Accessible epidemiological data are of great value for emergency preparedness and response, understanding disease progression through a population, and building statistical and mechanistic disease models that enable forecasting. The status quo, however, renders acquiring and using such data difficult in practice. In many cases, a primary way of obtaining epidemiological data is through the internet, but the methods by which the data are presented to the public often differ drastically among institutions. As a result, there is a strong need for better data sharing practices. This paper identifies, in detail and with examples, the three key challenges one encounters when attempting to acquire and use epidemiological data: 1) interfaces, 2) data formatting, and 3) reporting. These challenges are used to provide suggestions and guidance for improvement as these systems evolve in the future. If these suggested data and interface recommendations were adhered to, epidemiological and public health analysis, modeling, and informatics work would be significantly streamlined, which can in turn yield better public health decision-making capabilities.Comment: v2 includes several typo fixes; v3 adds a paragraph on backfill; v4 adds 2 new paragraphs to the conclusion that address Frontiers reviewer comments; v5 adds some minor modifications that address additional reviewer comment

arXiv.org e-Print Archive

Directory of Open Access Journals

Frontiers - Publisher Connector

Charliecloud's layer-free, Git-based container build cache

Author: Goff R. Shane
H. Claude
Hounshel Z. Noah
IV Davis
Lee Ashlyn
Ogas Jordan
Priedhorsky Reid
Stormer Benjamin
Publication venue
Publication date: 31/08/2023
Field of study

A popular approach to deploying scientific applications in high performance computing (HPC) is Linux containers, which package an application and all its dependencies as a single unit. This image is built by interpreting instructions in a machine-readable recipe, which is faster with a build cache that stores instruction results for re-use. The standard approach (used e.g. by Docker and Podman) is a many-layered union filesystem, encoding differences between layers as tar archives. Our experiments show this performs similarly to layered caches on both build time and disk usage, with a considerable advantage for many-instruction recipes. Our approach also has structural advantages: better diff format, lower cache overhead, and better file de-duplication. These results show that a Git-based cache for layer-free container implementations is not only possible but may outperform the layered approach on important dimensions.Comment: 12 pages, 12 figure

arXiv.org e-Print Archive

Forecasting the 2013--2014 Influenza Season using Wikipedia

Author: Del Valle Sara Y.
Deshpande Alina
Fairchild Geoffrey
Generous Nicholas
Hickmann Kyle S.
Hyman James M.
Priedhorsky Reid
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 03/11/2014
Field of study

Infectious diseases are one of the leading causes of morbidity and mortality around the world; thus, forecasting their impact is crucial for planning an effective response strategy. According to the Centers for Disease Control and Prevention (CDC), seasonal influenza affects between 5% to 20% of the U.S. population and causes major economic impacts resulting from hospitalization and absenteeism. Understanding influenza dynamics and forecasting its impact is fundamental for developing prevention and mitigation strategies. We combine modern data assimilation methods with Wikipedia access logs and CDC influenza like illness (ILI) reports to create a weekly forecast for seasonal influenza. The methods are applied to the 2013--2014 influenza season but are sufficiently general to forecast any disease outbreak, given incidence or case count data. We adjust the initialization and parametrization of a disease model and show that this allows us to determine systematic model bias. In addition, we provide a way to determine where the model diverges from observation and evaluate forecast accuracy. Wikipedia article access logs are shown to be highly correlated with historical ILI records and allow for accurate prediction of ILI data several weeks before it becomes available. The results show that prior to the peak of the flu season, our forecasting method projected the actual outcome with a high probability. However, since our model does not account for re-infection or multiple strains of influenza, the tail of the epidemic is not predicted well after the peak of flu season has past.Comment: Second version. In previous version 2 figure references were compiling wrong due to error in latex sourc

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

A Practical Guide for the Effective Evaluation of Twitter User Geolocation

Author: Ahmed Amr
Benjamin
Brent
Dredze Mark
Eisenstein Jacob
Han Bo
Hecht Brent
Jayasinghe Gaya
Jurgens David
Jurgens David
Jurgens David
Lui Marco
Mislove Alan
Priedhorsky Reid
Roller Stephen
Schwartz Raz
Starbird Kate
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/07/2019
Field of study

Geolocating Twitter users---the task of identifying their home locations---serves a wide range of community and business applications such as managing natural crises, journalism, and public health. Many approaches have been proposed for automatically geolocating users based on their tweets; at the same time, various evaluation metrics have been proposed to measure the effectiveness of these approaches, making it challenging to understand which of these metrics is the most suitable for this task. In this paper, we propose a guide for a standardized evaluation of Twitter user geolocation by analyzing fifteen models and two baselines in a controlled experimental setting. Models are evaluated using ten metrics over four geographic granularities. We use rank correlations to assess the effectiveness of these metrics. Our results demonstrate that the choice of effectiveness metric can have a substantial impact on the conclusions drawn from a geolocation system experiment, potentially leading experimenters to contradictory results about relative effectiveness. We show that for general evaluations, a range of performance metrics should be reported, to ensure that a complete picture of system effectiveness is conveyed. Given the global geographic coverage of this task, we specifically recommend evaluation at micro versus macro levels to measure the impact of the bias in distribution over locations. Although a lot of complex geolocation algorithms have been applied in recent years, a majority class baseline is still competitive at coarse geographic granularity. We propose a suite of statistical analysis tests, based on the employed metric, to ensure that the results are not coincidental.Comment: Accepted in the journal of ACM Transactions on Social Computing (TSC). Extended version of the ASONAM 2018 short paper. Please cite the TSC/ASONAM version and not the arxiv versio

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Results from the centers for disease control and prevention's predict the 2013-2014 Influenza Season Challenge

Author: Allen Christopher
Alper David
Aman Susan
Anil Kumar V. S.
Aslam Anoshã
Bakach Iurii
Barrett Chris
BASAGNI Stefano
Biggerstaff Matthew
Bisset Keith
Broniatowski David
Brooks Logan
Brownstein John S.
Butler Patrick
Chakraborty Prithwish
Chandra Priyadarshini
Chen Jiangzhuo
Del Valle Sara Y.
Deshpande Alina
Dredze Mark
Eggo Rosalind
Eubank Stephen
Fairchild Geoffrey
Farrow David
Finelli Lyn
Fox Spencer
Fung Isaac Chun Hai
Gambhir Manoj
Generous Nicholas
GESUALDO Francesco
Goldstein Ed
Hao Yi
Henderson Jette
Hickman Kyle S.
Hickmann Kyle S.
Hyman James M.
Hyun Sangwon
Karspeck Alicia
Kaup Hemchandra
Khadivi Pejman
Krishnan Ramesh
Laskowski Kathy
Lewis Bryan
Lipsitch Marc
Lum Kristian
Madhavan Satish
Marathe Madhav
Markar Ashirwad
Mekaru Sumiko R.
Meyers Lauren Ancel
Nagel Anna
Nsoesie Elaine O.
Pashley Bryanne
Paul Michael
PERRA NICOLA
Priedhorsky Reid
Ramakrishnan Anurekha
Ramakrishnan Naren
Rosenfeld Roni
Scarpino Sam
Schaible Braydon J.
Scott James
Sexton Jessica K.
Shaman Jeffrey
Singh Bismark
Srinivasan Ravi
STILO GIOVANNI
Tibshirani Ryan J.
Tozzi Alberto E.
Tse Zion Tsz Ho
Tsou Ming Hsiang
VELARDI Paola
Vespignani Alessandro
Yang Wan
Ying Yuchen
Zhang Qian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Background: Early insights into the timing of the start, peak, and intensity of the influenza season could be useful in planning influenza prevention and control activities. To encourage development and innovation in influenza forecasting, the Centers for Disease Control and Prevention (CDC) organized a challenge to predict the 2013-14 Unites States influenza season. Methods: Challenge contestants were asked to forecast the start, peak, and intensity of the 2013-2014 influenza season at the national level and at any or all Health and Human Services (HHS) region level(s). The challenge ran from December 1, 2013-March 27, 2014; contestants were required to submit 9 biweekly forecasts at the national level to be eligible. The selection of the winner was based on expert evaluation of the methodology used to make the prediction and the accuracy of the prediction as judged against the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Results: Nine teams submitted 13 forecasts for all required milestones. The first forecast was due on December 2, 2013; 3/13 forecasts received correctly predicted the start of the influenza season within one week, 1/13 predicted the peak within 1 week, 3/13 predicted the peak ILINet percentage within 1 %, and 4/13 predicted the season duration within 1 week. For the prediction due on December 19, 2013, the number of forecasts that correctly forecasted the peak week increased to 2/13, the peak percentage to 6/13, and the duration of the season to 6/13. As the season progressed, the forecasts became more stable and were closer to the season milestones. Conclusion: Forecasting has become technically feasible, but further efforts are needed to improve forecast accuracy so that policy makers can reliably use these predictions. CDC and challenge contestants plan to build upon the methods developed during this contest to improve the accuracy of influenza forecasts. © 2016 The Author(s)

Archivio della ricerca- Università di Roma La Sapienza